SSH URL.File \(\rightarrow\) New Project \(\rightarrow\) Version Control \(\rightarrow\) Git and paste the URL..Rmd file and replace “Your Name” with your name.The following paragraph and figure directly come from the website of Centers for Disease Control and Prevention (CDC):
COVID-19 Community Levels are a new tool to help communities decide what prevention steps to take based on the latest data. Levels can be low, medium, or high and are determined by looking at hospital beds being used, hospital admissions, and the total number of new COVID-19 cases in an area.
We will investigate how Covid-19 community levels have changed across US counties since the first data release on Feb 24, 2022. The data are weekly updated, and we use data slightly modified from the version updated on May 5, 2022. Click here if you are interested in raw data.
We first load relevant packages:
library(tidyverse)
library(sf)
We load Covid-19 community level data in a csv file. This is our typical “tidy” data frame.
covid <- read_csv("data/covid.csv")
covid
## # A tibble: 35,464 × 12
## county county_fips state county_populati… health_service_… health_service_…
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 Americ… 60000 Ameri… 47392 901 American Samoa
## 2 Guam 66000 Guam 168489 902 Guam
## 3 Common… 69000 Commo… 51851 903 Commonwealth of…
## 4 United… 78000 Unite… 106290 905 United States V…
## 5 Americ… 60000 Ameri… 47392 901 American Samoa
## 6 Guam 66000 Guam 168489 902 Guam
## 7 Common… 69000 Commo… 51851 903 Commonwealth of…
## 8 United… 78000 Unite… 106290 905 United States V…
## 9 Americ… 60000 Ameri… 47392 901 American Samoa
## 10 Guam 66000 Guam 168489 902 Guam
## # … with 35,454 more rows, and 6 more variables:
## # health_service_area_population <dbl>,
## # covid_inpatient_bed_utilization <dbl>,
## # covid_hospital_admissions_per_100k <dbl>, covid_cases_per_100k <dbl>,
## # covid_community_level <chr>, date <date>
We use filter and select to focus on the contiguous US and the following variables:
| variable name | description |
|---|---|
state |
State name |
county |
County name |
county_fips |
Federal Information Processing Standards (FIPS) five character county code |
county_population |
County population (2019 Census estimate) |
covid_community_level |
Covid-19 community level |
date |
Date of data release |
covid <- covid %>%
filter(!(state %in% c("United States Virgin Islands",
"Commonwealth of the Northern Mariana Islands",
"American Samoa",
"Puerto Rico",
"Alaska",
"Hawaii",
"Guam"))) %>%
select(state, county, county_fips, county_population,
covid_community_level, date)
In order to draw maps based on Covid-19 community level, we need another data set that has information on geometry of US counties. We read the shapefile us_counties.shp using st_read(). Its original shapefile (.shp) is downloadable from the US Census Bureau under “cb_2018_us_county_20m.zip” (resolution level 1:20,000,000).
The loaded data frame has the following variables:
| variable name | description |
|---|---|
statefips |
State FIPS |
countyfips |
County FIPS |
county |
County name |
geometry |
geometry features |
This is an sf object.
us_counties <- st_read("data/us_counties.shp", quiet = TRUE)
us_counties
## Simple feature collection with 3108 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: NAD83
## First 10 features:
## statefips countyfips county geometry
## 1 37 017 Bladen MULTIPOLYGON (((-78.902 34....
## 2 37 167 Stanly MULTIPOLYGON (((-80.49737 3...
## 3 39 153 Summit MULTIPOLYGON (((-81.68699 4...
## 4 42 113 Sullivan MULTIPOLYGON (((-76.81373 4...
## 5 48 459 Upshur MULTIPOLYGON (((-95.15274 3...
## 6 48 049 Brown MULTIPOLYGON (((-99.19587 3...
## 7 45 021 Cherokee MULTIPOLYGON (((-81.87441 3...
## 8 01 043 Cullman MULTIPOLYGON (((-87.11199 3...
## 9 54 023 Grant MULTIPOLYGON (((-79.48687 3...
## 10 46 055 Haakon MULTIPOLYGON (((-102.0011 4...
Q - What differences do you observe when comparing a typical tidy data frame to the new simple feature object?
sf and dplyrThe sf package plays nicely with our earlier data wrangling functions from dplyr.
filter()Compare county FIPS from two data sets by filtering for Durham county in NC. The state FIPS for NC is 37. Check variable types before filtering.
us_counties %>%
filter(county == "Durham", statefips == "37")
## Simple feature collection with 1 feature and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -79.0163 ymin: 35.86321 xmax: -78.69932 ymax: 36.23932
## Geodetic CRS: NAD83
## statefips countyfips county geometry
## 1 37 063 Durham MULTIPOLYGON (((-79.0095 35...
covid %>%
filter(county == "Durham County", state == "North Carolina")
## # A tibble: 11 × 6
## state county county_fips county_populati… covid_community_… date
## <chr> <chr> <chr> <dbl> <chr> <date>
## 1 North Ca… Durham C… 37063 321488 High 2022-02-24
## 2 North Ca… Durham C… 37063 321488 Medium 2022-03-03
## 3 North Ca… Durham C… 37063 321488 Low 2022-03-10
## 4 North Ca… Durham C… 37063 321488 Low 2022-03-24
## 5 North Ca… Durham C… 37063 321488 Low 2022-03-17
## 6 North Ca… Durham C… 37063 321488 Low 2022-03-31
## 7 North Ca… Durham C… 37063 321488 Low 2022-04-07
## 8 North Ca… Durham C… 37063 321488 Low 2022-04-14
## 9 North Ca… Durham C… 37063 321488 Low 2022-04-21
## 10 North Ca… Durham C… 37063 321488 Low 2022-04-28
## 11 North Ca… Durham C… 37063 321488 Low 2022-05-05
Q - How are they different?
mutate()In us_counties data, we will create a new variable county_fips that exactly matches county_fips from covid data. Here we use paste0() that concatenates strings.
us_counties <- us_counties %>%
mutate(county_fips = paste0(statefips, countyfips))
us_counties$county_fips %>%
head(5)
## [1] "37017" "37167" "39153" "42113" "48459"
select()Select county and county_fips from us_counties.
us_counties %>%
select(county, county_fips)
## Simple feature collection with 3108 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: NAD83
## First 10 features:
## county county_fips geometry
## 1 Bladen 37017 MULTIPOLYGON (((-78.902 34....
## 2 Stanly 37167 MULTIPOLYGON (((-80.49737 3...
## 3 Summit 39153 MULTIPOLYGON (((-81.68699 4...
## 4 Sullivan 42113 MULTIPOLYGON (((-76.81373 4...
## 5 Upshur 48459 MULTIPOLYGON (((-95.15274 3...
## 6 Brown 48049 MULTIPOLYGON (((-99.19587 3...
## 7 Cherokee 45021 MULTIPOLYGON (((-81.87441 3...
## 8 Cullman 01043 MULTIPOLYGON (((-87.11199 3...
## 9 Grant 54023 MULTIPOLYGON (((-79.48687 3...
## 10 Haakon 46055 MULTIPOLYGON (((-102.0011 4...
Notice that geometries are “sticky”. They are kept until deliberately dropped using st_drop_geometry. Manipulating spatial data objects is similar, but not identical to manipulating data frames.
us_counties %>%
select(county, county_fips) %>%
st_drop_geometry() %>%
slice(1:5)
## county county_fips
## 1 Bladen 37017
## 2 Stanly 37167
## 3 Summit 39153
## 4 Sullivan 42113
## 5 Upshur 48459
something_join()Create a new sf object covid_sf by joining covid and us_counties keeping all rows in covid. Note that two variable names are common, namely, county_fips and county, in both data frames.
Q - Which variable should we use to join the data frames by? Can we use any? Why or why not?
covid_sf <- covid %>%
left_join(us_counties, by = "county_fips") %>% # tbl_df/ tbl/ data.frame, no sf!
st_as_sf()
covid_sf
## Simple feature collection with 34188 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: NAD83
## # A tibble: 34,188 × 10
## state county.x county_fips county_populati… covid_community_… date
## <chr> <chr> <chr> <dbl> <chr> <date>
## 1 Alabama Geneva Cou… 01061 26271 High 2022-02-24
## 2 Alabama Greene Cou… 01063 8111 Medium 2022-02-24
## 3 Alabama Hale County 01065 14651 Medium 2022-02-24
## 4 Alabama Henry Coun… 01067 17205 High 2022-02-24
## 5 Alabama Houston Co… 01069 105882 High 2022-02-24
## 6 Alabama Jackson Co… 01071 51626 High 2022-02-24
## 7 Alabama Jefferson … 01073 658573 High 2022-02-24
## 8 Alabama Lamar Coun… 01075 13805 Medium 2022-02-24
## 9 Alabama Lauderdale… 01077 92729 Medium 2022-02-24
## 10 Alabama Lawrence C… 01079 32924 Low 2022-02-24
## # … with 34,178 more rows, and 4 more variables: statefips <chr>,
## # countyfips <chr>, county.y <chr>, geometry <MULTIPOLYGON [°]>
Q - Check what happens to the common variable that is not used in joining.
group_by(), summarize()Using covid, let’s compute the total population across counties in different Covid-19 community levels based on the latest data (date == "2022-05-05").
covid %>%
filter(date == "2022-05-05") %>%
group_by(covid_community_level) %>%
summarize(tot_pop = sum(county_population))
## # A tibble: 3 × 2
## covid_community_level tot_pop
## <chr> <dbl>
## 1 High 13272001
## 2 Low 251130313
## 3 Medium 61689792
sf and ggplotWe will examine how the Covid-19 community levels change over space and over time. As usual, we can build up a visualization layer-by-layer beginning with ggplot. Let’s start by making a basic plot of all US counties.
covid_sf %>%
ggplot() +
geom_sf() +
labs(title = "US counties")
Let’s try NC.
covid_sf %>%
filter(state == "North Carolina") %>%
ggplot() +
geom_sf() +
labs(title = "NC counties")
Now adjust the theme with theme_bw().
covid_sf %>%
filter(state == "North Carolina") %>%
ggplot() +
geom_sf() +
labs(title = "NC counties") +
theme_bw() # white canvas
If you liked the theme, we can globally fix it as a default theme instead of attaching that layer for every ggplot.
theme_set(theme_bw())
We now fill each county by the latest value of Covid-19 community level (date == "2022-05-05"). To match colors used in CDC, we use scale_fill_manual().
covid_sf %>%
filter(date == "2022-05-05") %>%
ggplot() +
geom_sf(aes(fill = covid_community_level)) +
labs(title = "Covid-19 community level across US counties as of May 5, 2022",
fill = "Covid-19 community level") +
scale_fill_manual(values = c("High" = "#fb8b5b",
"Medium" = "#f9f99d",
"Low" = "#00cc99"))
Now adjust color in geom_sf to change the color of the county borders.
covid_sf %>%
filter(date == "2022-05-05") %>%
ggplot() +
geom_sf(aes(fill = covid_community_level), # mapping
color = NA) + # setting
labs(title = "Covid-19 community level across US counties as of May 5, 2022",
fill = "Covid-19 community level") +
scale_fill_manual(values = c("High" = "#fb8b5b",
"Medium" = "#f9f99d",
"Low" = "#00cc99"))
Adjust the width of the county borders using size.
covid_sf %>%
filter(date == "2022-05-05") %>%
ggplot() +
geom_sf(aes(fill = covid_community_level), # mapping
size = 0) + # setting
labs(title = "Covid-19 community level across US counties as of May 5, 2022",
fill = "Covid-19 community level") +
scale_fill_manual(values = c("High" = "#fb8b5b",
"Medium" = "#f9f99d",
"Low" = "#00cc99"))
Finally, adjust the transparency using alpha by mapping
covid_sf %>%
filter(date == "2022-05-05") %>%
ggplot() +
geom_sf(aes(fill = covid_community_level), # mapping
alpha = 1) + # setting
labs(title = "Covid-19 community level across US counties as of May 5, 2022",
fill = "Covid-19 community level") +
scale_fill_manual(values = c("High" = "#fb8b5b",
"Medium" = "#f9f99d",
"Low" = "#00cc99"))
Q - What is your final choice of color, size, and alpha?
Q - Create the same plot for NC counties only. Different choices of color, size, and alpha might be more appealing in this case.
covid_sf %>%
filter(date == "2022-05-05",
state == "North Carolina") %>%
ggplot() +
geom_sf(aes(fill = covid_community_level)) +
labs(title = "Covid-19 community level across NC counties as of May 5, 2022",
fill = "Covid-19 community level") +
scale_fill_manual(values = c("High" = "#fb8b5b",
"Medium" = "#f9f99d",
"Low" = "#00cc99"))
Q - What is the county that has a different community level than the rest of the counties in NC?
We can visualize temporal trends of Covid-19 community levels by faceting.
Q - Create a faceted plot of US counties over all available dates where each county is filled with Covid-19 community level. Briefly explain the spatial and temporal trends.
Q - Repeat the previous exercise for states of your choosing.
covid_sf %>%
filter(state %in% c("New York", "Vermont", "New Hampshire",
"Maine", "Massachusetts", "Connecticut",
"Rhode Island", "New Jersey")) %>%
ggplot() +
geom_sf(aes(fill = covid_community_level)) +
facet_wrap(~ date) +
labs(title = "Covid-19 community level in Northeastern US since Feb 2022",
fill = "Covid-19 community level") +
scale_fill_manual(values = c("High" = "#fb8b5b",
"Medium" = "#f9f99d",
"Low" = "#00cc99")) +
theme(legend.position = "bottom")
Write a brief research question that you could answer with this dataset and then investigate it here.
What are limitations of your visualizations above?